Search CORE

16 research outputs found

Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

Author: Doddipatla Rama
Li Mohan
Publication venue
Publication date: 21/04/2023
Field of study

This paper presents the use of non-autoregressive (NAR) approaches for joint automatic speech recognition (ASR) and spoken language understanding (SLU) tasks. The proposed NAR systems employ a Conformer encoder that applies connectionist temporal classification (CTC) to transcribe the speech utterance into raw ASR hypotheses, which are further refined with a bidirectional encoder representations from Transformers (BERT)-like decoder. In the meantime, the intent and slot labels of the utterance are predicted simultaneously using the same decoder. Both Mask-CTC and self-conditioned CTC (SC-CTC) approaches are explored for this study. Experiments conducted on the SLURP dataset show that the proposed SC-Mask-CTC NAR system achieves 3.7% and 3.2% absolute gains in SLU metrics and a competitive level of ASR accuracy, when compared to a Conformer-Transformer based autoregressive (AR) model. Additionally, the NAR systems achieve 6x faster decoding speed than the AR baseline.Comment: 8 pages, 1 figure, accepted at IEEE SLT202

arXiv.org e-Print Archive

On End-to-end Multi-channel Time Domain Speech Separation in Reverberant Environments

Author: Barker Jon
Doddipatla Rama
Zhang Jisi
Zorila Catalin
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/11/2020
Field of study

This paper introduces a new method for multi-channel time domain speech separation in reverberant environments. A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings, with no need of conventional spatial feature extraction. To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied to further improve the separation performance. A spatialized version of wsj0-2mix dataset has been simulated to evaluate the proposed system. Both source separation and speech recognition performance of the separated signals have been evaluated objectively. Experiments show that the proposed fully-convolutional network improves the source separation metric and the word error rate (WER) by more than 13% and 50% relative, respectively, over a reference system with conventional features. Applying dereverberation as pre-processing to the proposed system can further reduce the WER by 29% relative using an acoustic model trained on clean and reverberated data.Comment: Presented at IEEE ICASSP 202

arXiv.org e-Print Archive

Crossref

On monoaural speech enhancement for automatic recognition of real noisy speech using mixture invariant training

Author: Barker Jon
Doddipatla Rama
Zhang Jisi
Zorila Catalin
Publication venue: 'International Speech Communication Association'
Publication date: 20/09/2022
Field of study

In this paper, we explore an improved framework to train a monoaural neural enhancement model for robust speech recognition. The designed training framework extends the existing mixture invariant training criterion to exploit both unpaired clean speech and real noisy data. It is found that the unpaired clean speech is crucial to improve quality of separated speech from real noisy speech. The proposed method also performs remixing of processed and unprocessed signals to alleviate the processing artifacts. Experiments on the single-channel CHiME-3 real test sets show that the proposed method improves significantly in terms of speech recognition performance over the enhancement system trained either on the mismatched simulated data in a supervised fashion or on the matched real data in an unsupervised fashion. Between 16% and 39% relative WER reduction has been achieved by the proposed system compared to the unprocessed signal using end-to-end and hybrid acoustic models without retraining on distorted data.Comment: Accepted to INTERSPEECH 202

arXiv.org e-Print Archive

Learning Noise Invariant Features Through Transfer Learning for Robust End-to-End Speech Recognition

Author: Do Cong-Thanh
Doddipatla Rama
Renals Steve
Zhang Shucong
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 09/04/2020
Field of study

Crossref

Edinburgh Research Explorer

A Teacher-Student approach for extracting informative speaker embeddings from speech mixtures

Author: Boeddeker Christoph
Cord-Landwehr Tobias
Doddipatla Rama
Haeb-Umbach Reinhold
Zorilă Cătălin
Publication venue
Publication date: 01/06/2023
Field of study

We introduce a monaural neural speaker embeddings extractor that computes an embedding for each speaker present in a speech mixture. To allow for supervised training, a teacher-student approach is employed: the teacher computes the target embeddings from each speaker's utterance before the utterances are added to form the mixture, and the student embedding extractor is then tasked to reproduce those embeddings from the speech mixture at its input. The system much more reliably verifies the presence or absence of a given speaker in a mixture than a conventional speaker embedding extractor, and even exhibits comparable performance to a multi-channel approach that exploits spatial information for embedding extraction. Further, it is shown that a speaker embedding computed from a mixture can be used to check for the presence of that speaker in another mixture.Comment: Accepted for Interspeech 202

arXiv.org e-Print Archive

Frame-wise and overlap-robust speaker embeddings for meeting diarization

Author: Boeddeker Christoph
Cord-Landwehr Tobias
Doddipatla Rama
Haeb-Umbach Reinhold
Zorilă Cătălin
Publication venue
Publication date: 01/06/2023
Field of study

Using a Teacher-Student training approach we developed a speaker embedding extraction system that outputs embeddings at frame rate. Given this high temporal resolution and the fact that the student produces sensible speaker embeddings even for segments with speech overlap, the frame-wise embeddings serve as an appropriate representation of the input speech signal for an end-to-end neural meeting diarization (EEND) system. We show in experiments that this representation helps mitigate a well-known problem of EEND systems: when increasing the number of speakers the diarization performance drop is significantly reduced. We also introduce block-wise processing to be able to diarize arbitrarily long meetings.Comment: ICASSP 202

arXiv.org e-Print Archive